For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-
dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-
dimensional, or three-dimensional block of threads, called a thread block. This provides a natural way
to invoke computation across the elements in a domain such as a vector, matrix, or volume.
The index of a thread and its thread ID relate to each other in a straightforward way: For a one-
dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of
a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a
thread of index (x, y, z) is (x + y Dx + z Dx Dy).
As an example, the following code adds two matrices A and B of size NxN and stores the result into
matrix C.
∕∕ Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
∕∕ Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
There is a limit to the number of threads per block, since all threads of a block are expected to reside
on the same streaming multiprocessor core and must share the limited memory resources of that
core. On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number
of threads is equal to the number of threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread
blocks as illustrated by Figure 4. The number of thread blocks in a grid is usually dictated by the size
of the data being processed, which typically exceeds the number of processors in the system.
The number of threads per block and the number of blocks per grid specified in the <<<...>>> syntax
can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-
dimensional unique index accessible within the kernel through the built-in blockIdx variable. The
dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.
∕∕ Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
∕∕ Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N ∕ threadsPerBlock.x, N ∕ threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice. The
grid is created with enough blocks to have one thread per matrix element as before. For simplicity,
this example assumes that the number of threads per grid in each dimension is evenly divisible by the
number of threads per block in that dimension, although that need not be the case.
Thread blocks are required to execute independently. It must be possible to execute blocks in any
order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in
any order and across any number of cores as illustrated by Figure 3, enabling programmers to write
code that scales with the number of cores.
Threads within a block can cooperate by sharing data through some shared memory and by synchroniz-
ing their execution to coordinate memory accesses. More precisely, one can specify synchronization
points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a
barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory
gives an example of using shared memory. In addition to __syncthreads(), the Cooperative Groups
API provides a rich set of thread-synchronization primitives.
For efficient cooperation, shared memory is expected to be a low-latency memory near each processor
core (much like an L1 cache) and __syncthreads() is expected to be lightweight.
